Analyzing biomarker discovery: Estimating the reproducibility of biomarker sets

Many researchers try to understand a biological condition by identifying biomarkers. This is typically done using univariate hypothesis testing over a labeled dataset, declaring a feature to be a biomarker if there is a significant statistical difference between its values for the subjects with different outcomes. However, such sets of proposed biomarkers are often not reproducible – subsequent studies often fail to identify the same sets. Indeed, there is often only a very small overlap between the biomarkers proposed in pairs of related studies that explore the same phenotypes over the same distribution of subjects. This paper first defines the Reproducibility Score for a labeled dataset as a measure (taking values between 0 and 1) of the reproducibility of the results produced by a specified fixed biomarker discovery process for a given distribution of subjects. We then provide ways to reliably estimate this score by defining algorithms that produce an over-bound and an under-bound for this score for a given dataset and biomarker discovery process, for the case of univariate hypothesis testing on dichotomous groups. We confirm that these approximations are meaningful by providing empirical results on a large number of datasets and show that these predictions match known reproducibility results. To encourage others to apply this technique to analyze their biomarker sets, we have also created a publicly available website, https://biomarker.shinyapps.io/BiomarkerReprod/, that produces these Reproducibility Score approximations for any given dataset (with continuous or discrete features and binary class labels).


Introduction
Improved understanding of a disease can lead to better diagnosis and treatment. This often begins by finding "biomarkers", which is a generic term referring to "a characteristic that is objectively measured and evaluated as an indicator of normal biological processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention" [1]. Typically, these are individual features (e.g., expression values of specific genes [2,3]) that follow different distributions (e.g., have different mean values) in diseased patients versus healthy controls. Sometimes, biomedical researchers can identify candidates for biomarkers based on their existing knowledge of the disease etiology and/or cellular pathways. This is done by seeking features that are causally related to the disease (e.g., phenylketonuria is caused by mutations in the single gene PAH [4]) or a symptom of it (e.g., Hemoglobin A1C for monitoring the degree of glucose metabolism in diabetes [5]). This paper, however, focuses on the use of statistical tools to discover and evaluate biomarkers, which is typically based on a dataset-a matrix whose rows each correspond to a subject (say a person) and each column corresponds to a feature (e.g., clinical measure, or the expression value of a gene), and with a final column providing the outcome (e.g., a binary outcome distinguishing case versus control); see Fig 1. These "biomarker discovery studies" (also known as "association studies") then attempt to determine which of the features (columns) differ significantly among distinct outcomes. Two standard examples of such studies are the "Genome Wide Association Studies" (GWASs), over a set of SNPs [6]; and the "Gene Signature Studies", over gene expression values [7]. Typically, this involves first computing a representative statistic for each feature (e.g., for continuous entries, running a t-test based on the mean and variance of the case versus control), and then declaring a feature to be a biomarker if the resulting MCC-corrected p-value is below 0.05 [8], where we use "MCC" (Multiple Comparison Corrections) as a generic term, which includes both False Discovery Rate (FDR) correction, and Family-Wise Error (FWE) Correction. (Section B.I in S1 Appendix discusses some of the subtleties here, especially with respect to features).
In some situations, the researchers then validate these biomarkers using a biological or medical process (e.g., based on knock-out or amplification studies [9,10]). Other studies validate the proposed biomarkers based on existing biological knowledge. A third class of projects  [14], with respect to the group outcome (here "Metastasis" for breast cancer). The circled features, with p<0.05, are (purported) biomarkers. For notation: We will refer to each of the first r columns of the matrix as a "feature"; these are often called "(independent) variables". We refer to the final column as a "outcome"-e.g., case versus control (shown here as Y versus N)-these are often called "labels", "dependent variables", "groups", "phenotypes" or "classes". Finally, we will use "subject" to refer to each row of that matrix; these are sometimes called "instances" or "samples".
https://doi.org/10.1371/journal.pone.0252697.g001 instead use the potential biomarkers to create a computational model-perhaps to learn a classifier [11][12][13]-and then measure that down-stream model (perhaps based on its accuracy on a held-out set) and declare the biomarkers to be useful if that model scores well.
A great many papers, however, simply publish the list of purported biomarkers without providing validation for this set; see Section B.2 in S1 Appendix. This paper focuses specifically on this case. We address this limitation by providing a falsifiable (statistical) claim about such sets of biomarkers, which suggests a way to validate the proposed biomarker sets.
While some biomarkers are causally related to the associated outcome, this can be difficult to establish (often requiring instrumented studies [15]); but fortunately, in many situations, it may be sufficient for the features to be correlated with the phenotype. Here, an ideal biomarker discovery process would identify all-and-only the features that are consistently correlated with the associated disease, in that its presence (or absence or . . .) alone supports that disease. This argues that a proposed biomarker is good if it was reproducible-i.e., that the biomarkers found in one study, would appear in many (ideally, all) future studies that explore this disease.
This has motivated the use of independent test sets to check the validity of the earlier findings. Unfortunately, many papers report this is not the case-i.e., that relatively few biomarkers appear across multiple studies. For example, while the breast cancer studies by van't Veer et al. [2] (resp., Wang et al. [3]) reported signatures with 70 (resp., 76) genes, these two sets had only 3 genes in common. Ein-Dor et al. [16] notes this in another situation: "Only 17 genes appeared in both the list of 456 genes of Sorlie et al. [17] and the 231 genes of van't Veer et al. [2]; merely 2 genes were shared between the sets of Sorlie et al. and Ramaswamy et al. [18]. Such disparity is not limited to breast cancer but characterizes other human disease datasets (Alizadeh et al. [19]) such as schizophrenia (Miklos and Maleszka [20])". Many others [21][22][23] report similar findings. Indeed, Begley and Ellis [24] report that only 6 of 53 published findings in cancer biology could be confirmed; which Wen et al. [25] notes is "a rate approaching an alarmingly low 10% of reproducibility". Moreover, a 2016 Nature survey [26], of over 1500 scientists, found that 70% of researchers have tried but failed to reproduce another scientist's experiments, and 52% thought there was a significant 'crisis' of reproducibility.
There are many possible reasons for this.
1. Each study should consider the same well-defined "distribution" over instances-e.g., over the same distribution of ages and genders, etc. If the study attempts to distinguish case from control, then the two sub-populations should only differ in a single characteristic. Unfortunately, matching cases and controls over all possible features is often not achievable.

A second issue
is with the precise notion of what "reproducible" means. Is it a property of a specific biomarker, or of a set of biomarkers? There is no clear choice for an optimal objective measure. This is especially problematic when dealing with multi-factorial diseases, where the outcomes correspond to a disjunction over many sub-diseases [27]. See also Section B.3 in S1 Appendix.

A final important issue
is the impact of sample size. Many studies have a relatively low number of subjects, which increases the probability of finding both false negatives and false positives.
Our analysis assumes that the researchers have addressed (1) by running carefully designed, well-specified studies. Further, we also assume that there is no uncertainty in the outcomes with respect to its clinical or biological definition. We will provide a precise measure of reproducibility (2), as well as some specific implementations, and show empirically how this measurement varies with sample size (3).
In this paper, we assume that biomarkers are stand-alone features. Note each feature could be a pre-defined combination of single features (e.g., the average expression values of the genes associated with a pre-defined signalling pathway-see gene enrichment [28]) or networks of genes associated with high loadings of principal component and univariate Pearson correlation values (see PC-corr [29]); but we are not considering learning combinations. By a Biomarker Discovery process BD(�), we mean a function that takes as input a labeled data matrix D of n subjects over a set of r features, each labeled with its outcome, and identifies a subset of proposed biomarkers; see Fig 1. We will formally define a Reproducibility Score, RS(D, BD), to quantify the "reproducibility" of the set of proposed biomarkers produced by the biomarker discovery process: viz., the average Jaccard score between these proposed biomarkers BDðDÞ; and those produced by running the same BD process over another comparable dataset drawn from the same distribution where two datasets are comparable if they have the same number of subjects from each outcome. (Section B.3 in S1 Appendix discusses some subtle issues related to "reproducibility".) In order to estimate the reproducibility score in practice, we construct two approximations: an overbound and an underbound. We then provide empirical tests over many datasets, with a focus on t-tests as the main biomarker discovery process. We provide many examples for both microarray data (with continuous values) and SNP data (with discrete values) to provide practical evidence for the effectiveness of these approximations. Researchers can use this framework to estimate the reproducibility of the potential results of their biomarker discover study. A low reproducibility score suggests that these biomarkers may not be accurate, potentially because the dataset used is too small, the dataset is too heterogenous, or the biomarker discovery algorithm is not suitable for the dataset. To help users evaluate the quality of their proposed biomarker sets, we have also produced a publicly available website, https://biomarker. shinyapps.io/BiomarkerReprod/, that, given a labeled dataset, computes these estimates of the Reproducibility Score, with respect to any of a variety of biomarker discovery algorithms.
Outline: Section 2 formally defines the Reproducibility Score (RS) and describes the challenges of estimating this measure. We then define two approximations for RS: an overbound and an underbound. It also describes some of the standard biomarker discovery algorithms. Section 3 describes extensive empirical studies over many datasets-microarray and mRNAseq data (with continuous values) and SNP data (with discrete values), focusing on a standard Biomarker Discovery process BD(�), to confirm the effectiveness of these approximations. Section 4 summarizes some future work and the contributions of this paper. In the Supplementary Information, Section B in S1 Appendix discusses various notes: a glossary of the various technical terms used, how Biomarker Discovery differs from standard (supervised) Machine Learning, different notions of "reproducibility", how a combination of a pair of features might be important for a predictive task, even if neither, by itself, is important (towards explaining why it can be so difficult to find biomarkers); etc. Section C in S1 Appendix presents results from other empirical studies, which explore how the RS varies with the type of MCC correction used (including "none"), the p-value threshold, the size of the dataset and the number of iterations of the approximation algorithms. Finally, the approximations we present are motivated by two heuristics (Heuristics 7 and 9). Section D in S1 Appendix presents arguments that motivate these heuristics, and also provide additional empirical evidence that support them.
We close this section by motivating the need for an objective measure for evaluating the quality of a set of biomarkers (Subsection 1.1), then overviewing some earlier studies that discuss the issue of reproducibility in biomarker discovery and/or provide approaches that could be beneficial when dealing with such problems (Subsection 1.2).

Motivation for evaluating biomarker sets
To motivate the need for evaluating association studies, consider first predictive studies, which use a labeled dataset, like the one shown at the top of Fig 1, to produce a predictive model (perhaps a decision tree, or a linear classifier) that can be used to classify future subjects-here, into one of the two classes: Y or N. In addition to the learned classifier, the researchers will also compute a meaningful estimate of its quality-i.e., of the accuracy (or AUROC, Kappa Score, etc.) of this classifier on an independent hold-out set [30], or the results of k-fold cross-validation over the training sample.
By contrast, many association studies report only a set of purported biomarkers, but provide no falsifiable claim about the accuracy of these biomarkers. Many meta-reviews claim that a set of biomarkers is problematic if they are not reproduced in subsequent studies [16,[31][32][33]. Given that biomarkers should be reproducible, we propose evaluating a biomarker set based on its reproducibility score. An accurate estimate of this score can help in at least the following three ways: 1. Researchers can compare various different "comparable" biomarker discovery algorithms to see which produces the biomarker set that is most reproducible. Here, "comparable" corresponds to the standard practice of only considering discovery tools that impose the same criterion, such as the same p-value, or only considering features that exhibit the same minimum fold-change. This type of analysis may help to determine errors within the biomarker discovery process. Moreover, we will see that MCC-correction, while useful in removing false-positives, can be detrimental to the goal of producing reproducible biomarkers; similarly, there is no reason to insist on p < 0.05 for the statistic test used.

2.
A low reproducibility score suggests that few of the proposed biomarkers will be found in another dataset, and highlights the potential that these proposed biomarkers may not be accurate. This could motivate researchers to consider a dataset that is larger, to focus on a more homogenous population, or perhaps consider another biomarker discovery technique.
3. Finally, there are many meta-reviews [21,34,35] that note the lack of repeatability in many biomarker discovery papers, and question whether the techniques used are to blame. One way to address this concern is to require that each published paper include both the purported set of biomarkers, and also an estimate of its reproducibility score. The same way a prediction study's "5-fold cross validation" accuracy tells the reader how accurate the classification model should be on new data, this biomarker-discovery reproducibility score will inform the reader whether to expect another study, on a similar dataset, will find many of the same biomarkers. Note that we should view the reproducibility score as necessary for considering a proposed model, but not sufficient-i.e., it might rule-out a proposed discovery model, but should not be enough to rule-in a model.
For these reasons, we provide an easy-to-use, publicly available webapp https://biomarker. shinyapps.io/BiomarkerReprod/ that anyone can use to produce meaningful estimates of the reproducibility of a set of biomarkers. (The underlying code is also available, from https:// github.com/amirfrz/BMDA).

Related work
There have been many pairs of studies that have each produced biomarkers for the same disease or condition, but found little or no overlap between the two lists of purported biomarkers. Many papers have discussed this issue-some describing this problem in general [16,33,35], and others exploring specific examples [8,32]. These papers suggest different reasons for the problem, such as the heterogeneous biological variations in some datasets [16,33] or problems in the methods used that may lead to non-reproducible results [36,37].
In particular, Zhang et al. [33] challenge the claim that the non-reproducibility problem in microarray studies is due to poor quality of microarray technology, by showing that inconsistencies occur even between technical replicates of the same dataset. They also show that heterogeneity in cancer pathology would further reduce reproducibility.
Ein-Dor et al. [16] also show the inconsistencies between the results of subsamples of a single dataset, demonstrating that the set of (gene) biomarkers discovered is not unique. They explain that there are many genes correlated with the group outcomes, but the empirical correlations change for different (sub)samples of instances. These two papers motivate our need for tools that can effectively estimate the reproducibility-such as the ones presented here.
Several projects [35][36][37] have attempted to formally analyse this problem. Ein-Dor et al. [35] describe a method, Probably Approximately Correct (PAC) sorting, that estimates the minimum number of instances needed for a desired level of reproducibility. As an example, this worst-case analysis proves that, to guarantee a 50% overlap between different gene lists for breast cancer, each dataset needs to include at least several thousand patients. This suggests poor repeatability results when using small sample sizes, which is consistent with our results for datasets with smaller sample sizes; see Subsection 3, especially Fig 4. The goal of the MicroArray Quality Control (MAQC) project [36] was to address the problems and uncertainties about the microarray technology that were caused by the observation that different studies (of the same phenotype) often found very different biomarkers.
They suggest that the common approach of using just t-test p-values (especially with stringent p-values) can lead to poor reproducibility, which motivated them to consider methods like fold-change ranking with a non-stringent p cutoff, which they demonstrate leads to more reproducible gene sets. In a follow-up, Guo et al. [37] found similar results by using the same procedures for another dataset. However, Klebanov et al. [38] later show that these MAQC project results do not prove that using t-tests is necessarily unsuitable-i.e., just because another method (here fold-change) can generate more reproducible results, does not mean that it is performing better; as an extreme, the algorithm that declares every gene is a biomarker (think p = 1.0), is completely reproducible. They demonstrate these points by using a set of simulation studies (where they know the "true biomarkers"), and use either t-test or fold-change to propose potential biomarkers. These studies found that the t-test approach performed much better than the fold-change, in terms of recall (sensitivity). These results motivated us to use the ttest approach (rather than fold-change) as our main BD algorithm-which we use for all of our empirical experiments.
Our approximation algorithms use a type of re-sampling to bound reproducibility. Below we summarize several other studies that similarly deal with re-sampling and biomarker discovery. Some studies provide ways to better estimate the true statistical significance, but do not provide a framework for evaluating empirical reproducibility-e.g., Gagno et al. [39] used bootstrap resampling to estimate 95% confidence intervals and p-values for an internal assessment of their findings (related to breast cancer survival), Chitpin et al. [40] proposed a resampling-based method to better estimate the false discovery rate in chromatin immunoprecipitation experiments, and Pavelka et al. [41] proposed a resampling-based hypothesis testing algorithm that provides a control of the false positive rate for identification of differentially expressed genes. Furthermore, Alshawaqfeh et al. [42] and Zhao and Li [43] suggested methods for consistent biomarker detection in high-throughput datasets where evaluation was based on common biomarkers among the two resampled sets-i.e., they are considering the false positives but not false negatives. (By contrast, our use of Jaccard score involves both.) Other studies had different goals-e.g., Ma et al. [44] used resampling in a permutation test, to evaluate the predictive power of the identified gene set based on the accuracy of downstream classification task; recall however that our goal is to evaluate the reproducibility of the biomarkers directly (not downstream). Filosi et al. [45] propose methods for evaluating the stability of reconstructed biological networks in terms of inference variability due to data subsampling; we however are focusing on the reproducibility of the individual biomarkers. Hua et al. [46] conducted a simulation study to compare the ranking performance of several gene set enrichment methods; by contrast, our approach considers the SET of biomarkers, not the ranking, and is over several real-world datasets (not just simulated ones). Note that none of these used re-sampling techniques to bound the expected replicability of the set of biomarkers found by some discovery algorithm, nor to demonstrate the validity of those bounds.

Formal description
As illustrated in Fig 1, a "Biomarker Discovery" algorithm, BD(�), takes as input a dataset D of n subjects, each described by r features F = {f 1 , . . ., f r } and labeled with a binary outcome, and returns a subset F 0 � F of purported biomarkers.
Typically, each f 2 F 0 differs in some significant way between each class. To be more precise, let x j i be the value of the i th feature of the j th subject, and ℓ (j) be the outcome of the j th subject (which is either + or -). Then a class difference means that the set fx j i j ' ðjÞ ¼ þg of values of the i th feature of the diseased individuals is significantly different from the values of that feature over the healthy individuals, fx j i j ' ðjÞ ¼ À g. For simplicity, we will assume that these fx j i g i;j values are either all continuous (such as height, or the expression value of a gene), or all discrete (such as gender, or the genotype of a SNP). Subsection 2.2 below will describe several such biomarker discovery algorithms.
As noted in Eq 1, the Reproducibility Score RS(D, BD) quantifies the "reproducibility" of the set of proposed biomarkers BD(D) corresponding to running the biomarker discovery algorithm over the labeled dataset D. Here, we assume that the values of each feature x j : , for each outcome c, are generated independently from a fixed distribution (i.e., "i.i.d.") Here and in general, we use P (�) to refer to either a probability density for continuous variables, or a probability mass for discrete variables. Note these are just the marginal distributions: we do not assume that the various features are independent from one another.-i.e., this does not necessarily correspond to Naive Bayes [30]. We will viewpð�Þ ¼ ½p i;c ð�Þ� i;c as the matrix of these r × 2 different distributions, and letp ½n þ ;n À � ð�Þ be the distribution for sampling n = n + + n − instances independently from this distribution, where n þ 2 Z þ instances are drawn from the distribution [p 1,+ (�), . . ., p r,+ (�)] associated with positive outcomes, and similarly n À 2 Z þ instances from the distribution [p 1,− (�), . . ., p r,− (�)] associated with negative outcomes. Then for datasets D 0 ; D 00 �p ½n þ ;n À � ð�Þ sampled independently, we define RS � ðpð�Þ; ½n þ ; n À �; BDð�ÞÞ ¼ E½ JðBDðD 0 Þ; BDðD 00 Þ Þ � ð2Þ where the Jaccard score of two sets Of course, Eq 4 suggests the obvious bootstrap sampling algorithm [48]. Empirically, however, we found that it did not perform well (see Section C.1 in S1 Appendix)-motivating the algorithms described in Subsection 2.3.

Biomarker discovery algorithms: BD(�)
We now discuss various approximation algorithms for biomarker discovery. As our goal is to illustrate the reproducibility issues with respect to the standard approach, we focus on that standard approach: where the biomarker discovery is based on independent two-sample ttests, perhaps with some multiple comparison correction. This use of t-tests implicitly assumes that the feature values, for each outcome, are normally distributed. However, it is easy to adapt these algorithms to use other statistical tests, that do not make this distributional assumption.
See also Limitations in Section 4. Recall that we are considering two types of datasets, depending on whether its feature values (the x j i mentioned above) are continuous or discrete. However, for datasets with categorical values-SNPs in our analysis-we use a simple preprocessing step, which precedes all the BD(�) algorithms described here, to convert each categorical value to a real number: here converting each SNP feature, which ranges over the values { AA, Ab, bb }, to the real-values { 0, 1, 2 }, corresponding to the number of minor alleles ("b") in the genotype. This allows us to view each such dataset as one with continuous values.
We assume that the real values of each feature, for each outcome, follows a normal distribution, which might be different for the different outcomes, and so we use an independent twosample t-test for all of our empirical experiments. Recall that the test statistic is given by ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 n þ þ 1 n À s � s p ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ðn þ À 1Þ � s 2 þ þ ðn À À 1Þ � s 2 À n þ þ n À À 2 where n + and n − are the number of instances with positive and negative outcomes, respectively, and with empirical means � X À and � X þ and empirical variances � s 2 þ and � s 2 À .
Note the biomarker discovery process essentially performs a single statistical test for each feature. As the number of features is often large-often tens-of-thousands, or more-many projects sought ways to reduce the chance of false discoveries; a standard way to do this is through some MCC method. We therefore consider biomarker discovery algorithms described as BD t, p, χ (D), where the t in the subscript refers to the 2-sided t-test, the p for the p-value used, and χ to the MCC method. Our canonical example is BD t, 0.05, BH (D), with p = 0.05, and χ = BH to the Benjamini/Hochberg correction [49]. This notation makes it easy to consider many variants -e.g., adjusting the p-value used for the statistical test, whether it is applying another multiple testing correction, or none, etc. See Section C.2 in S1 Appendix for more details.

Algorithms that approximate the reproducibility score
As we have the dataset D with n � [n + , n − ] labeled instances, we can directly compute BD(D). To compute RS(D, BD(�)), however, we also need to produce one (or more) similar datasets D 0 , each with [n + , n − ] subjects drawn from the same (implicit) distributionpð�Þ that generated D (with the same number of positive and negative instances), but which is presumably disjoint from D. While we do not have such D 0 's, and so cannot directly compute the Reproducibility Score, we show below how to compute an overbound and an underbound of RS(D, BD(�)).

Overbound: oRS.
The oRS(D, BD(�), k) procedure produces (an estimate of) an overbound for RS(D, BD(�)), by making it easier for a feature to be selected to be in both purported biomarker sets. In this algorithm, D is our given fixed dataset, BD(�) is the given biomarker discovery algorithm, and k is a parameter to determine the number of trials when computing the overbound. (Here, and below, see the Glossary in Section A in S1 Appendix for a summary of the algorithms and their arguments.) The oRS algorithm first defines a size-2n dataset DD that contains two copies of each subject in D, of course with the same outcome both times. It then randomly partitions this DD into two disjoint size-n datasets D 1 and D 2 , balanced by outcome. Here, each partition is with respect to the list of elements, so it will include duplicates. To insure that the resulting datasets are balanced, oRS first split DD into DD + and DD − , where DD + are the cases and DD − the controls. It then forms D þ 1 by randomly drawing 1/2 of DD + , and D À 1 by randomly drawing 1/2 of DD − , then merges The dataset D 2 is then formed from the remaining subjects of DD not included in D 1 . oRS then runs BD(�) on the datasets D 1 (resp., D 2 ) to produce two sets of biomarkers, and computes the Jaccard score for this pair of biomarker sets: J(BD(D 1 ), BD(D 2 )). It then repeats this double-split-BD-Jaccard process k times, then returns the average of these k values: where each dataset pair [D1 r , D2 r ] is created independently using the above procedure. For each r, as we expect D1 r to overlap with D2 r , it is relatively likely that any D1 r -biomarker will also be a D2 r -biomarker (more likely than if D1 r was disjoint from D2 r ), which means we expect the associated Jaccard score to be higher. This follows from the heuristic that, as two datasets have more common elements, we expect the number of biomarkers common to two datasets, to increase-i.e., if A 1 ; A 2 ; B 1 ; B 2 �p ½n þ ;n � ð�Þ and then we expect JðBDðA 1 Þ; BDðA2ÞÞ will be larger than JðBDðB 1 Þ; BDðB2ÞÞ

PLOS ONE
Analyzing biomarker discovery: Estimating the reproducibility of biomarker sets ceteris paribus. Here, we view each of {A 1 , A 2 , B 1 , B 2 } as a set of n = n + + n − r-dimensional instances. Note this relationship is simply a heuristic to motivate the algorithm-one that we expect to hold in practice. Section D in S1 Appendix provides some basic arguments, and empirical evidence, to support this claim. We close with three quick observations: 1. Expected Overlap: We expect 50% of the instances to be duplicated in any given pair of datasets. See Lemma 2 in Section D.4 in S1 Appendix.

Relation to Bootstrap Samples:
Here, each subject occurs exactly twice in each pair of datasets D1 r and D2 ,r . If we instead used bootstrap sampling (called bRS below), we expect many subjects would occur more often in the pair of datasets. (See Lemma 3 in Section D.4 in S1 Appendix.) Given Heuristic 7, this means we expect bRS's Jaccard score here to be higher than for oRS's; as oRS is already an overbound for the true score (RS), this means bRS would be a worse bound, as bRS � oRS � RS. This is why we use our "doubling approach" oRS rather than bootstrap sampling bRS, as oRS produces values that are smaller, but still remains an overbound, as desired. See Figs 7 and 8 and Section C.1 in S1 Appendix.
3. Relation to RS � : Our oRS approach is clearly related to RS � (Eq 2), as both compute the average Jaccard score of pairs of size-n datasets sampled from a distribution. They differ as (a) RS � uses the true distributionpð�Þ, while oRS uses only the estimate b p D ð�Þ, (b) RS � is the true average while oRS is just the empirical average over k trials, and (c) RS � will draw independent datasets, but the oRS datasets will overlap.

Underbound: uRS.
The uRS(D, BD, k) procedure produces (an estimate of) an underbound for RS(D, BD(�)) by making it harder for a feature to be selected to be in both purported biomarker sets. First, observe that as [n + , n − ] increases (keeping the n + -to-n − ratio fixed, as we consider changing the size of the dataset), we expect the statistical estimates to be

PLOS ONE
Analyzing biomarker discovery: Estimating the reproducibility of biomarker sets more accurate, and in particular, statistical tests for differences between the two classes will be correct more often. Hence, a statistical test will better identify the "true" biomarkers F � from a size-n subset D (n) , versus from a size-n/2 subset D (n/2) . Now consider two size-n datasets D ðnÞ 1 and D ðnÞ 2 , and also two size-n/2 datasets E then we expect JðBDðA 1 Þ; BDðA2ÞÞ will be larger than JðBDðB 1 Þ; BDðB2ÞÞ Fig 4(a) presents empirical evidence, over 5 datasets, supporting this claim -showing that the Jaccard score increases as we increase the size s of the datasets. Section D in S1 Appendix provides some arguments, and additional empirical evidence (over hundreds of simulations), that further support this heuristic.

Empirical study over various datasets
There are now many publicly-available datasets that have been used in association studies.
Here, we use them to . . .
More generally, we can do this for any size-s subset D (s) of D where s � n/2. Here, we need a set of pairs of disjoint outcome-balanced subsets D 0 , D 00 � D where |D 0 | = |D 00 | = s and D 0 \D 00 = {}. For a fixed dataset D, and specified number k 2 Z >0 , we can then plot these c RSð D ðsÞ ; BDð�Þ; kÞ values along with oRS(D (s) , BD, k) and uRS(D (s) , BD, k), as a function of s to see their behaviour; see Fig 4(b), for the Metabric dataset. Our website https://biomarker. shinyapps.io/BiomarkerReprod/ also provides this visualization.
We explored our approximations over 25 different real-world datasets, including 16 microarray datasets and 2 RNAseq datasets with continuous data, and 7 SNP datasets with

PLOS ONE
Analyzing biomarker discovery: Estimating the reproducibility of biomarker sets categorical data (see Table 1). This first set includes 4 of the gene expression datasets discussed in the Zou et al. [8] meta-analysis-each describing metastatic versus non-metastatic breast primary cancer subjects-to see if our method is consistent with their empirical results. We also included 11 other relatively-small gene expression datasets (with 19 to 187 subjects), focusing on human studies that had a binary class outcome from the GEO repository. To explore how our tools scale with size, we also included 3 other relatively large datasets, with 532 to 1654 subjects. As these were survival datasets, we set the binary outcome based on the median survival time (removing any subject that was censored before that median time). In addition to these 4+11+3 = 18 gene expression datasets (with real-valued entries), we also include 7 SNP datasets (with 39 to 164 subjects), with discrete values, also selected from human studies with binary class outcome s. Fig 5 plots the number of features and biomarkers found, using the BD t, 0.05, BH algorithm, for each dataset-both D (n) and D (n/2) .

Results
We ran our suite of methods over the aforementioned 25 datasets, including 16 microarray datasets and 2 mRNAseq datasets, whose feature-values fx j i g (recall each x j i is the expression value of the i-th gene for the j-th subject; we log 2 -transformed the values from the mRNAseq Table 1. Results for all 25 datasets when using all the subjects. This table is sorted by sample size (#subjects)-corresponding to Fig 5. The first 18 entries are Gene Expression datasets (including the 4 " � "ed entries, from the Zou et al. [8] meta-study), and the final 7 are SNP datasets. Reproducibility Scores are shown in the form of mean ± standard deviation. The "(Majority %)" values are the percentage of the subjects in the dataset with the more common outcome-e.g., 53% of the subjects in the GDS968 dataset are labeled "+" (for "long survival time"), and 82% of GSE7390 are labeled "−" (for "Non-Metastatic").

PLOS ONE
Analyzing biomarker discovery: Estimating the reproducibility of biomarker sets datasets) and 7 were SNP datasets with categorical entries, i.e., x j i 2 f0; 1; 2g is the number of minor alleles in the genotype for the i th SNP for the j th subject; see Table 1. Here, we use the standard BD t, 0.05, BH (�) biomarker discovery algorithm; see Section 2.2.
First, to address (U1) and (U2) in Section 2.4, we analyzed the 4 datasets mentioned in the Zou et al. [8] meta-analysis (see the 4 " � " rows of Table 1) and computed the {uRS(D, BD t, 0.05, BH , 50), oRS(D, BD t, 0.05, BH , 50)} values for each dataset D, as well as the actual Jaccard score for biomarkers for each pair of datasets. (We were not able to replicate the results reported for the 5 th dataset from that study, and so we excluded that one dataset from our analysis.) The results, in Fig 6, show that the Jaccard score for each pair is well within the bounds computed by our approximations, for each of the datasets in that pair-that is, the results for 4 × 3 = 12 ordered-pairs of datasets are consistent with our predictions. Note that we verified that our BD t, 0.05, BH algorithm matched the original study by verifying that the PO scores (Eq 15 in Section C.4 in S1 Appendix) matched the ones that were originally published.
To address (U3), we also analyzed the other 14 continuous datasets D, and computed the oRS and uRS values using k = 50 repetitions; see Fig 7[left]. We see that the overbound oRS is consistently larger than the underbound uRS-i.e., uRS � oRS-as claimed by Eqs 11 and 12. Fig 7[right] plots the corresponding values for the D (n/2) datasets, that use only 1/2 of the dataset, using the same BD(�) algorithm and k = 50. It also plots the "true" c RSð D ðn=2Þ ; BDð�Þ; kÞ values for the datasets. Again, we see that oRS � c RS � uRS. Finally, similar to that experiment over the 18 continuous datasets, we examined the 7 discrete datasets and produced the reproducibility scores. Fig 8[left] shows the scores for each of the 7 SNP datasets, demonstrating that oRS � uRS holds for the discrete cases as well. Fig 8  [right] shows the scores for D (n/2) datasets, and performs the same verification. Those figures also allow us to see the Jaccard scores (U1 above) range from essentially 0 to around 0.475 for the D (n/2) datasets. In addition to the plots, Tables 1 and 2 present the relevant values.

PLOS ONE
Section C in S1 Appendix provides the results of many additional empirical studies, showing how the reproduciblity scores change based on which (if any) MCC method is used, the specific p-value used for the t-test, the number of draws k used by the various approximations, and the size of the dataset n.

Recommended use
As suggested above, whenever researchers have identified a possible set of biomarkers from a dataset D and Biomarker Discovery Algorithm BD, we encourage them to apply our analytic technique, and where appropriate, even our specific bounding algorithms (from https://  Empirical results for continuous datasets. Reproducibility scores (mean and standard deviation) for all 18 continuous datasets, both for complete datasets with n subjects (left) and for half-sized with n 2 subjects (right), for k = 50 iterations. The x-axes (for both plots) are sorted by the value of the over-bound for the D (n) datasets. We see, in both, that the over-bound oRS is consistently higher than the under-bound uRS. Moreover, the right plot shows that the "truth" c RS is also between uRS and oRS. (The bRS lines in the plots are based on the Bootstrap Overbound method; see Section C.1 in S1 Appendix). https://doi.org/10.1371/journal.pone.0252697.g007

PLOS ONE
Analyzing biomarker discovery: Estimating the reproducibility of biomarker sets Empirical results for SNP datasets. Reproducibility scores (mean and standard deviation) for 7 SNP datasets, both for complete datasets with n subjects D (n) (left) and for half-sized with n 2 subjects D (n/2) (right), for 50 iterations. The x-axes (for both plots) is sorted by the value of the overbound oRS for the D (n) datasets. We see, in both, that the over-bound oRS is consistently higher than the under-bound uRS. Moreover, the right plot shows that the "truth" c RS is also within the range of oRS and uRS. (The bRS lines in the plots are based on the Bootstrap Overbound method; see Section C.1 in S1 Appendix).
https://doi.org/10.1371/journal.pone.0252697.g008 Table 2. Results for all datasets when using half of the subjects-i.e., D (n/2) . Reproducibility Scores and average number of biomarkers are shown in the form of mean ± standard deviation. (The caption for Table 1 describes the row ordering).

Name
Average #biomarkers uRS % c RS % oRS % biomarker.shinyapps.io/BiomarkerReprod/), to estimate the Reproducibility Score of this proposed set. If those scores (especially the lower bound uRS(D, BD)) are sufficiently high, they can use those biomarkers, confident that they (as a set) are reproducible. But if not, the researchers may want to explore other ways to identify reproducible biomarkers. One obvious approach is to use a larger dataset, with more instances, as we know that RS increases with the size of the dataset |D|; see Heuristic 7 and the analysis in Section D.2 in S1 Appendix. Alternatively (or in addition), recall the reproducibility depends on both the dataset, and the Biomarker Discovery Algorithm BD; perhaps some other BD test,p − val, MCC would work better here?
We could consider modifying all 3 of the components: • While the MCC methods are designed to reduce the false positives, it is not clear whether they improve RS. Indeed, our experiments (Fig C.1 in S1 Appendix) show that RS values reduce with MCC-which points to using BD test,p, None .
• The non-reproducibility problem might be due to the statistical test used. As noted earlier, the t-test implicitly assumes a Gaussian distribution of the features (or the normality of residuals); perhaps another test would work better for some dataset / distribution of instances.
• Finally, they may want to modify the p-value, p, as other values of p may lead to better reproducibility.

Limitations
This paper provides an effective way to estimate the reproducibility of the biomarkers found from a dataset, using a biomarker discovery algorithm. While the message of this paper is very general, the specific analyses here all used the standard discovery algorithm BD t, 0.05, BH , to illustrate that the issues apply to the approaches commonly used. Section C in S1 Appendix explores some other discovery methods, that are also based on t-tests. While this use of t-tests implicitly assumes normality of the feature values, nothing in our general approach relies on this specific test-we could use other tests that make other parametric assumptions. (Note that our analysis appears to work effectively even when dealing with some Bernoulli (non-normal) features.) We anticipate the same approach would hold for other tests, such as Mann-Whitney, Wilcoxon or even multivariate analysis-but this is future work. All of our empirical studies dealt with standard datasets, whose outcomes are binary and whose values were either all real values or all categorical values; none had some of each. Our analytic model considers the overlap of biomarkers found from two datasets, of the same size-e.g., we do not consider how the biomarkers obtained from a size-100 dataset, overlap with those from a size-300 dataset. Finally, the recommendations of the previous subsection suggest another future direction: produce the "BD 0 (D, s) algorithm" that, given a dataset D and a minimum score s > 0, returns the parameters [test, p-val, MCC] for a biomarker discovery algorithm BD test,p − val, MCC that would produce a biomarker set whose Reproducibility Score would be at least s-i.e., we expect that RS(D, BD test,p − val, MCC (�)) � s, suggesting this level of reproducibility for the proposed biomarkers BD test,p − val, MCC (D).

Contributions
This paper has several contributions: (1) It motivates, then provides, a formal definition of reproducibility, that can help researchers evaluate a set of purported biomarkers; (2) It provides a pair of algorithms that can accurately bound this "reproducibility score" for a given dataset and biomarker discovery algorithm; (3) It provides empirical evaluation of these algorithms, over 25 different real-world datasets, to demonstrate that they work effectively; and (4) It introduces a freely-available website https://biomarker.shinyapps.io/BiomarkerReprod/ that runs these algorithms on the dataset entered by a user, and biomarker discovery system, which will allow users to easily evaluate the quality of the biomarker set produced. Given how easy it is to use this tool, we hope that future researchers will automatically use it to quickly evaluate the quality of the purported biomarkers, then include these estimates when they publish their biomarkers. We anticipate this analysis may also lead to new biomarker discovery algorithms, to optimize this reproducibility score.